Time Series Classification with InceptionFCN

Deep neural networks (DNN) have proven to be efficient in computer vision and data classification with an increasing number of successful applications. Time series classification (TSC) has been one of the challenging problems in data mining in the last decade, and significant research has been proposed with various solutions, including algorithm-based approaches as well as machine and deep learning approaches. This paper focuses on combining the two well-known deep learning techniques, namely the Inception module and the Fully Convolutional Network. The proposed method proved to be more efficient than the previous state-of-the-art InceptionTime method. We tested our model on the univariate TSC benchmark (the UCR/UEA archive), which includes 85 time-series datasets, and proved that our network outperforms the InceptionTime in terms of the training time and overall accuracy on the UCR archive.


Introduction
Time series classification started to attract attention in the early 2000s, and, at that time, the proposed methods were based on traditional, algorithm-based approaches. However, applying deep learning techniques for time-series data has recently become trendy amongst researchers. With the growth of data, the requirement of processing that data is also increasing. Therefore, there has been high demand among data mining researchers to extract, analyze, and understand the time-series data. However, with the implementation of deep neural networks, the capability of classifying the data also significantly increased. The idea of deep learning was first introduced by Yann LeCun [1] in 1998, when the multilevel artificial neural network was developed to classify the handwritten digits, but it truly gained popularity after the introduction of AlexNet [2] for image classification. Since then, tremendous research has been conducted that has led to the creation of successful algorithms. Deep neural networks are sophisticated and capable of solving computer vision problems by handling multidimensional data, such as image spatial information, for classification or localization. Most of the success of deep learning in image recognition tasks has been attributed to the "depth" of the architectures. Time series data, however, are computationally simpler and require the computation of only sequenced data.
UCR open-source archive, the largest time-series (TS) data collection, is used as a benchmark by most of the researchers to test the model performance. The UCR archive currently contains 156 datasets that were initially collected and normalized [3]. The datasets consist of different data lengths and numbers of classes. As a good example of the different methodologies to classify the TS data, we can outline Hassan Ismail Fawaz et al. [4], where the authors implemented various deep learning modules, including multilayer perceptron (MLP), encoder, residual network (ResNet), fully convolutional network (FCN), and compared the results with each and every dataset in the UCR archive for both univariate and multivariate data. Another upgraded network was proposed by the same authors [5], where they first adapted the Inception module for time series classification (TSC) in which each module has the same architecture with different randomly initialized weight values. The main idea behind the Inception module is to apply several simultaneous filters of varying length on a continuous time-series input. We investigated the diversity of the best-related work and found out that there are still topics to improve. For this study, we can decrease the training time without an accuracy drop by applying a wider number of filters to the fundamental layers inside the Inception block and modifying the hyperparameters. Instead, we added the computational overhead to amplify the performance. Additionally, we implemented deeper FCN with varying convolutional filters and finetuned the network by adding pooling layers and included a dropout layer to decrease the likelihood of overfitting and become less sensitive to smaller fluctuations in the time-series data.
The key contributions are summarized as follows: 1.
Inception block modification-we modified the existing Inception module by finetuning the parameters for the convolutional and max-pooling layers. We created narrower convolutional layers than the original Inception block by comprising more kernels per layer. These changes speed up the training due to the decrease in the number of parameters and FLOPs [6].

2.
Aggregation-we combined the deeper FCN block with the Inception module to boost the classification performance. We sequentially trained the initial time-series features on Inception and on FCN modules, then we merged the output with adding layers at the end of the network. Although each contribution is easy to be implemented, we believe this is the first work including the combination of both methods.
The rest of the paper is organized as follows. Section 2 briefly describes the UCR archive data collection and discusses various related work in traditional and deep learning methods. We explain the proposed network architecture in Section 3. Section 4 demonstrates the experimental results and illustrates numerical comparisons. Section 5 discusses our findings and the exclusion of some related work. Finally, Section 6 concludes the paper and discusses the possible future work.

UCR Archive
The UCR archive was first introduced in 2002, and, since then, it has been an important resource for data miners. The archive itself previously consisted of 46 classes before a significant update in 2015, when researchers increased the number of datasets to 85. Additionally, the data were normalized and denoised. The latest update was completed in 2018 [7], with an expansion from 85 to 128 datasets. Sequential data were annotated, and a specific number was assigned as a class label. The number of classes for datasets varies from 2 to 60. According to Bing H. et al. [8], the datasets can be categorized into the time-series data definitions in which: Definition 1. A univariate time-series is an ordered set of real-valued variables, and only local properties are being used in a time-series, and subsequences are enclosed. T = [T 1 , T 2 . . . T n ], where n is a sequence length with only one dimension X.

Definition 2.
Multivariate time-series with larger dimensions, which consist of number ordered elements [T = T 1 , T 2 . . . T n ] and T i ∈ R X , where X is number of dimensions with n number or ordered elements. Definition 3. Set of data pairs where dataset can be either univariate or multivariate.
where X ≥ 1, thus, the data can also be presented in sets.

Machine-Learning-Based Classification Algorithms
There are diverse classification algorithms; however, we considered describing three main categories for TSC with details of specific algorithms. These categories are based on:
K-means clustering with dynamic time warping is a well-known unsupervised learning method that constructs clusters of data by splitting samples into k groups and minimizing the sum-of-squares in each cluster. K-means algorithms are not always effective because they use the Euclidean algorithm that often produces pessimistic similarity measures when it encounters distortion in the time axis [9]. Replacing that algorithm with dynamic time warping (DTW) is one of the ways to enhance the classification performance. DTW finds the optimal non-linear alignment between two time-series data. DTW between two time-series data can be formulated as following optimization problem: where a and b are time-series data in the domain of P. Thanawin Rakthanmanon et al. [10] implemented DTW algorithm to classify earlier UCR archives in 2012. TimeSeriesForest (TSF) is a supervised learning method that is an ensemble of decision trees. Houtao Deng et al. [11] created a tree-ensemble method that employs a combination of entropy gain and a distance measure. TSF randomly samples features at each tree node and has computational complexity linear in the length of time series and can be built using parallel computing techniques. TSF is computationally efficient and outperforms DTW.
Bag-of-SFA-Symbols (BOSS) [12] is another method for classification that combines the extraction of substructures with the tolerance to data and noise reduction for time-series. The novelty in this approach was the Symbolic Fourier Approximations (SFA), which performs approximation and quantization operations [12]. The BOSS classifier is based on 1-nearest-neighbour (1-NN) classification. It searches for the 1-NN within a set of samples by minimizing the distance between the query and the input data.
However, with the advancement of the deep learning approaches, the above-mentioned distance (DTW), interval (TSF), and dictionary (BOSS) methods are being infrequently used.

Deep-Learning-Based TSC
All the algorithms listed above require some feature engineering methods as a separate task before classification and imply the higher loss of some necessary information within the processing time. On the other hand, in DL algorithms, features are directly learned via network training. Convolutional neural networks (CNN) show promising results for more complex computations, such as image classification and object recognition. However, unlike image, time-series input is fed to the convolutional layer with one-dimensional filters that can extract continuous discriminant features. In this section, we describe the most widely used DL techniques, such as: Multi-layer perceptron (MLP); 2.

Multi-Layer Perceptron for Classification
The feedforward neural network consists of multiple layers of neurons (perceptron). It computes the weighted sum of the input values and uses an activation function (sigmoid, tanh, ReLU) to the result. MLP is a fully connected network in which the output is obtained by computing in a sequence of the activations of a perceptron. Zhiguang Wang et al. [13] Sensors 2022, 22, 157 4 of 14 used basic MLP containing three fully connected layers with 500 neurons in each layer. They used dropout for each layer to improve generalizability. ReLU activation function is used to prevent saturation of the gradient descent for a deeper network. According to them, each layer block satisfies the following condition:

Fully Convolutional Networks for Classification
Convolutional networks are mostly applicable to the images or multivariate time-series as they capture temporal and spatial patterns through the filters and assign the importance value to those patterns. Generally, FCNs have convolutional, batch normalization, pooling, and fully connected layers. The convolution is defined by a set of filters that are fixed-size matrices. According to Hassan Ismail Fawaz et al., FCN shows competitive quality and efficiency for TSC. FCN outperforms all the known deep learning methods, including ResNet. FCN for TSC was first introduced by Zhiguang Wang et al. [13], where they composed the network of three convolutional blocks and performed three consecutive operations on each block, such as convolution, batch normalization, and ReLU activation function. The output of the last third block is averaged with global average polling for the reduction in several parameters over the whole-time dimension.
Baydadaev et al. [14] proposed to use FCN for the vestibulo-ocular reflex (VOR) custom dataset. VOR data are the fixed-length multivariate time-series data that give information about head and eye velocities. Basically, the classification of the VOR data is based on the output gain; however, the difficulty of the classification was due to frequent irrelevant data that are considered as noise or artifacts. In that work, we proposed a 1D impulse classification network that could classify signals with 95% accuracy. Generally, the ICN network takes an input with a fixed length for signal data and is dependent on the amount of time-series data, quality of features, and the network architecture.

Residual Networks for Classification
Residual networks are very deep structured networks that are easier to optimize. ResNet extends deeper by learning residual functions and having a shortcut connection to each residual block without learning all the unreferenced functions that enable the gradient flow directly through the bottom layers. ResNet for TSC was originally proposed by Wang et al. (2017) and was modified by Hassan Ismail Fawaz et al. As shown in Figure 1, the network architecture consists of three residual blocks, each with a convolutional layer, batch normalization layer and ReLu activation function, followed by a global average pooling layer with a softmax classifier at the bottom of the network. The number of convolutional filters in layers in the residual blocks have 64, 128, and 128 with the fixed filter length of 8, 5, and 3 for the first, second, and third blocks, respectively. ResNet has a constant number of parameters for various datasets.

InceptionTime for Classification
Saeed Karimi-Bidhendi et al. mapped time-series data into Gramian angular difference field (GADF) images and used pretrained Inception v3 model to map those images into 2048-dimensional vector space. Then, they applied MLP to classify time-series data ( [15]). Hassan Ismail Fawaz et al., however, proposed the state-of-the-art InceptionTime network that purely works on 1D time-series data. InceptionTime is a combination of five Inception networks, where they use the concept of the receptive field (RF), which is defined as the size of the region in the input that produces the feature. Most of the time, RF is used in CNN as a convolutional unit that depends on a local region of the input and is extensively used in the image segmentation [16]. Nevertheless, in our case, the input is 1D time-series data, and RF is computed by the formula: where n is the network depth, and x i is filter length where i ∈ [1, n]. The authors claimed that increasing the filter length for all the layers increases the RF for each network layer. Predictions made by a network with different initializations are computed by the formula: where y i,c denotes the output of the network for data x i ; which belong to a class c that can be either univariate or multivariate, and the function f c is averaged over n models.

InceptionTime for Classification
Saeed Karimi-Bidhendi et al. mapped time-series data into Gramian angular difference field (GADF) images and used pretrained Inception v3 model to map those images into 2048-dimensional vector space. Then, they applied MLP to classify time-series data ( [15]). Hassan Ismail Fawaz et al., however, proposed the state-of-the-art InceptionTime network that purely works on 1D time-series data. InceptionTime is a combination of five Inception networks, where they use the concept of the receptive field (RF), which is defined as the size of the region in the input that produces the feature. Most of the time, RF is used in CNN as a convolutional unit that depends on a local region of the input and is extensively used in the image segmentation [16]. Nevertheless, in our case, the input is 1D time-series data, and RF is computed by the formula: where is the network depth, and is filter length where ∈ [1, . The authors claimed that increasing the filter length for all the layers increases the RF for each network layer. Predictions made by a network with different initializations are computed by the formula: where , denotes the output of the network for data ; which belong to a class that can

Network Architecture
Our proposed method has similarities with the InceptionTime method. The differences, however, include that we separated the network into two parts: the inception module and the shortcut module. The inception module is composed of two blocks. A bottleneck layer with a linear activation function is used at the top of the network to reduce the size of the input tensor in a convolutional layer that is a 1D convolutional layer with an input size of 64 and 1 × 1 kernel size. The rest of the block is similar to the original inception block and consists of the three convolutional layers with multiple filters of different sizes. Next, we added a max-pooling layer to make our model stable to small feature translation. Feature translation means that even if we reduce the dimensions of the feature map values, the output from the max-pooling layer does not change. Furthermore, another convolutional layer was added to extract hidden features from the sliding 1 × 1 filter. At the end of the block, there is a concatenating layer. We reduced the number of the inception blocks from three to two, and, unlike the InceptionTime network, our InceptionFCN has a much smaller number of trainable parameters as we applied convolutions of length {10, 20} instead of {40, 20, 10} as proposed. We avoided the use of the overparameterized network due to the risk of overfitting it. Fawaz et al. [5] created several models with architectural hyperparameters studies, where they modified the parameters of the inception blocks. In our proposed work, however, we chose optimal hyperparameters for the inception block to keep the training time and accuracy in the optimal trade-off. Instead, in the shortcut module, we applied deeper FCN, used in another study [14], with slight architectural changes. Implemented FCN consists of six 1D convolutional layers of the same size but with a different kernel length. The use of the deeper FCN compensates for the reduction of one inception block in the overall network performance. The initial data from the dataset are passed through the 1 × 1 kernel to every 128 filters on the first convolutional layer as an input. 1D convolutions can be performed to reach the desired number of labels. The loss of the FCN can be calculated by averaging the cross-entropy of every timestep and mini-batch. At the bottom of the shortcut block, we have the last layer in which we add the output tensor from the Inception module to the output tensor from the FCN module. Then, we perform global average pooling for the added tensor output, and, lastly, we perform a dropout to exclude the features from half of the nodes. The overview of the Inception module and FCN block are shown in Figure 2. Each convolutional layer is followed by batch normalization and ReLU activation function.
volutional layer was added to extract hidden features from the sliding 1 × 1 filter. At the end of the block, there is a concatenating layer. We reduced the number of the inception blocks from three to two, and, unlike the InceptionTime network, our InceptionFCN has a much smaller number of trainable parameters as we applied convolutions of length {10, 20} instead of {40, 20, 10} as proposed. We avoided the use of the overparameterized network due to the risk of overfitting it. Fawaz et al. [5] created several models with architectural hyperparameters studies, where they modified the parameters of the inception blocks. In our proposed work, however, we chose optimal hyperparameters for the inception block to keep the training time and accuracy in the optimal trade-off. Instead, in the shortcut module, we applied deeper FCN, used in another study [14], with slight architectural changes. Implemented FCN consists of six 1D convolutional layers of the same size but with a different kernel length. The use of the deeper FCN compensates for the reduction of one inception block in the overall network performance. The initial data from the dataset are passed through the 1 × 1 kernel to every 128 filters on the first convolutional layer as an input. 1D convolutions can be performed to reach the desired number of labels. The loss of the FCN can be calculated by averaging the cross-entropy of every timestep and mini-batch. At the bottom of the shortcut block, we have the last layer in which we add the output tensor from the Inception module to the output tensor from the FCN module. Then, we perform global average pooling for the added tensor output, and, lastly, we perform a dropout to exclude the features from half of the nodes. The overview of the Inception module and FCN block are shown in Figure 2. Each convolutional layer is followed by batch normalization and ReLU activation function.

Training the Network with UCR Archive
Before training the network, the data from the UCR archive must be preprocessed due to the diversity in the number of classes, the data distribution, and intensity. We considered only the first definition data described in Section 2, which is the univariate time-series; therefore, we used a UCR archive consisting of 85 univariate datasets to fairly compare our experimental results with the existing methods. To adjust the data to the uniformly distributed one, we normalized the input data to the range of 0.0 to 0.1. We nullified the unavailable timestep values inside some datasets to maintain the integrity of the incoming data. Since the amount of the trainable data is very large, we used multithreading (parallelism) to fasten the preprocessing step. Our InceptionFCN network is scalable in the input layer as the input sizes are different in every dataset. InceptionFCN was trained with a single RTX2080 GPU computer; therefore, the training time may be different from that of Hassan Ismail Fawaz et al., where they used over 60 GPUs for training/testing. We used 1600 epochs for each class during training. Each dataset has a different training time since the input sizes and the number of classes is different. We included the early stopping method with the patience of 60 epochs. The median test accuracy was used as an evaluation metric. For comparison, we also trained the InceptionTime network on our machine with the hyperparameters presented by Hassan Ismail Fawaz et al. to check the processing time for both networks. We managed to decrease the training by reducing the number of trainable parameters. Using empirical trial-and-error mechanisms, the following hyperparameters are selected for training our network, as shown in Table 1.

Evaluation Metric
We evaluated the overall accuracy, inference time, and FLOPs for the UCR-85 archive in the UCR benchmark. The number of characteristics (e.g., the number of classes, the size of the training/test, and time-series input size) vary according to each dataset. As an example, the authors of the UCR archive provided the per class error (PCE) for three classification types, such as 1-NN Euclidean distance, DTW, and 1-NN DTW, with no window wrapping as a benchmark. We calculated the PCE for each dataset to evaluate the classification metric on multiple datasets. PCE is found with the formula: where n is the number of classes and acc refers to the classification accuracy.

Numerical Results and Comparison
To keep the comparison fair, we evaluated our network performance and selected the four best DL methods that claimed to be the SOTA results within recent years: Incep-tionTime, MLP, FCN, and ResNet. We have created the fairgrounds for each network and trained each network on our single GPU machine. Our network performed the best in both accuracy and performance tests, having the lowest PCE rate on the UCR-85 archive. From Table 2, we can see that our proposed network outperformed previous well-known methods by achieving higher accuracy for most of the datasets. In total, our model achieved results with Win/Tie/Loss of 52/9/17 out of 85 datasets, which is significant. InceptionTime also showed competitively good performance. However, our network is significantly faster in terms of training without affecting the overall performance of the network. Table 3 shows the difference in the computational cost (FLOPs), which is proportional to inference and training time. Fewer total parameters than the InceptionTime provides a lower rate of overfitting the model, even on greater datasets. Our proposed method is trained significantly faster and has two times smaller architecture compared to the InceptionTime method (135 M vs. 309 M FLOPs) due to the fewer number of inception blocks and kernels.

Wilcoxon Signed-Ranks Test
We applied the Wilcoxon signed-ranks test, which is a nonparametric statistical hypothesis test. The test ranks the differences in the performances of two classifiers (ours versus InceptionTime) for each dataset and compares the ranks for the positive and negative differences (R+ and R−). This test helps evaluate the difference between the two methods. We set the null hypothesis as the two methods perform equally well for all 85 datasets, and the alternative hypothesis is that our model performs better than InceptionTime.
Let us assume ∂ i to be the difference between the classifiers' performance, where I is the classifier value on the n-th dataset. The differences are ranked from the lowest absolute value to the highest one. In the case of ties, average ranks are taken for those datasets. From this, we can calculate the R+ and R− by the following equations: Let T be the smaller of the sums, T = min (R+, R−). General statistics include exact critical values for T for a large number of datasets, where: 1 24 n(n + 1)(2n + 1) Table 4 shows a performance comparison between InceptionFCN and InceptionTime. By calculating the difference, we can compute the ranks that are used to find the value of T. The sum of positive ranks is R+ = 954.5 (Equation (7)), and the sum of negative ranks is R− = 2710.5 (Equation (8)). This shows that our model outperforms InceptionTime because the difference is calculated by subtracting the InceptionTime accuracy from InceptionFCN accuracy. From R− and R+, we find that the value of T is 954.5. Note that the number of datasets used in this experiment is n = 85. Using Equation (9), we find that z is equal to −3.83, which is smaller than −1.96 for α= 0.05 using the z-score table. Therefore, we reject the null hypothesis and conclude that our method achieves better results compared to InceptionTime, as proposed. The next step is to identify the statistical significance of this difference between the two methods. One way is to use a sign test (a form of the binomial test [17]) for wins and losses between the two best methods (since InceptionTime performs better than the other existing methods, we only compare the difference between InceptionTime and our InceptionFCN for this experiment). Under the null hypothesis, two compared methods should perform equally, which means each should win n/2 datasets. Since the number of wins is distributed according to a binomial distribution [18], we can use the z-test. For example, if one method wins at least in ω α , shown in Equation (10), this method is considered significantly better than the other with p < 0.05.
for n = 85 and z α = 1.96, the value of ω α is 51.54. From Table 3, it is seen that our method wins 55 (head-to-head comparison) datasets out of 85, which not only supports the alternate hypothesis (that our method outperforms InceptionTime) but also shows that the difference between these two methods is statistically significant with p < 0.05. Figure 3 shows the critical difference diagram for the compared methods.  Figure 3 shows the critical difference diagram for the compared methods. For the multiple classifier comparisons, we used the Wilcoxon-Holm post hoc test to determine which classifiers are significantly different from one another. The average arithmetic rank represented in Figure 3 shows that our InceptionFCN is surpassing the wellknown DL models. The critical difference diagram with Wilcoxon-Holm post-hoc analysis for the data presents the proof that adding deeper FCN to finetuned inception blocks improves the overall accuracy for the UCR archive. The critical difference is found by the following formula [19]:

Critical Difference Calculation for Multiple Classifiers
Where k = 5 is the number of selected classifiers, is the critical values that are based on the Studentized range statistic divided by √2, and n is the number of datasets.

Discussion
There are other DL methods that we did not consider for this research, such as LSTM-FCN by Fazle Karim et al. [20]. Their results outperformed most of the existing methods, and this approach is regarded as one of the first choices to address the DL-based timeseries classification problem. However, the computational complexity is very high, particularly when many subsequent layers are used as LSTM networks can learn long-term relationships that go through geometric feature evolutions [21]. Therefore, a vast complexity increase was unacceptable for our research. For similar reasons, we did not consider implementing ResNet architecture. ResNet would be redundant when it is used with the Inception block since both blocks perform similarly. Moreover, Fawaz et al. already compared the performance of the InceptionTime with the ResNet model. Furthermore, our motivation for this research was to make a faster inferencing network with a timeaccuracy trade-off.
Through various experiments, we showed that our proposed model could achieve a competitive performance while maintaining a smaller and more optimized network architecture. We believe this research will pave the way for many further research works directed at optimizing the network structure so that these methods can be implemented on small embedded devices with limited computational and memory resources.

Conclusions
In this paper, we enhanced a deep neural classifier for numerous time series classification tasks. Inspired by the Inception-based research, we evolved the inception module to achieve high performance and low computational cost (fewer FLOPs). We finetuned the network parameters and added a deeper shortcut FCN block to improve the performance for the TSC. Our approach is proven to be highly scalable as it can be applied to various time series collections of different sizes (i.e., UCR archive). The proposed method also simplified the network training as we reduced twice the number of parameters and conducted the current research using a single GPU machine. Moreover, using the For the multiple classifier comparisons, we used the Wilcoxon-Holm post hoc test to determine which classifiers are significantly different from one another. The average arithmetic rank represented in Figure 3 shows that our InceptionFCN is surpassing the well-known DL models. The critical difference diagram with Wilcoxon-Holm post-hoc analysis for the data presents the proof that adding deeper FCN to finetuned inception blocks improves the overall accuracy for the UCR archive. The critical difference is found by the following formula [19]: CD = q α y(y + 1) 6n (11) where k = 5 is the number of selected classifiers, q α is the critical values that are based on the Studentized range statistic divided by √ 2, and n is the number of datasets.

Discussion
There are other DL methods that we did not consider for this research, such as LSTM-FCN by Fazle Karim et al. [20]. Their results outperformed most of the existing methods, and this approach is regarded as one of the first choices to address the DL-based time-series classification problem. However, the computational complexity is very high, particularly when many subsequent layers are used as LSTM networks can learn long-term relationships that go through geometric feature evolutions [21]. Therefore, a vast complexity increase was unacceptable for our research. For similar reasons, we did not consider implementing ResNet architecture. ResNet would be redundant when it is used with the Inception block since both blocks perform similarly. Moreover, Fawaz et al. already compared the performance of the InceptionTime with the ResNet model. Furthermore, our motivation for this research was to make a faster inferencing network with a time-accuracy trade-off.
Through various experiments, we showed that our proposed model could achieve a competitive performance while maintaining a smaller and more optimized network architecture. We believe this research will pave the way for many further research works directed at optimizing the network structure so that these methods can be implemented on small embedded devices with limited computational and memory resources.

Conclusions
In this paper, we enhanced a deep neural classifier for numerous time series classification tasks. Inspired by the Inception-based research, we evolved the inception module to achieve high performance and low computational cost (fewer FLOPs). We finetuned the network parameters and added a deeper shortcut FCN block to improve the performance for the TSC. Our approach is proven to be highly scalable as it can be applied to various time series collections of different sizes (i.e., UCR archive). The proposed method also simplified the network training as we reduced twice the number of parameters and conducted the current research using a single GPU machine. Moreover, using the Wilcoxon signed rank test and Wilcoxon-Holm post-hoc analysis, we showed that the InceptionFCN model outperforms InceptionTime significantly.
However, all the experiments were focused on the univariate datasets. For future work, we would like to expand our network to perform on multivariate data archives, such as UCR-128 and UCR-156. Moreover, we look forward to applying our architectural advancements in deep neural networks for various computer vision tasks. Data Availability Statement: Data available in a publicly accessible repository: the UCR time-series classification archive. Publicly available datasets were analyzed in this study. These data can be found at www.cs.ucr.edu/~eamonn/time_series_data/.